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AWARDS ABSTRACT 

A method and apparatus for supervised neural learning of 
time dependent trajectories exploits the concepts of adjoint 
operators to enable computation of the gradient of an objective 
functional with respect to the various parameters of the network 
architecture in a highly efficient manner. Specifically, it 
combines the advantage of dramatic reductions in computational 
complexity inherent in adjoint methods with the ability to solve 
two adjoint sytems of equations together forward in time. Not 
only is a large amount of computation and storage saved, but the 
handling of real-time applications becomes also possible. The 
invention has been applied it to two examples of representative 
complexity which have recently been analyzed in the open 
literature and demonstrated that a circular trajectory can be 
learned in approximately 200 iterations compared to the 12000 
reported in the literature. A figure eight trajectory was 
achieved in under 500 iterations compared to 20000 previously 
required. The trajectories computed using our new method are 
much closer to the target trajectories than was reported in 
previous studies. 
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BACKGROUND OF THE INVENTION 


Origin of the Invention:. 

The invention described herein was made in the performance of work 
under a NASA contract, and is subject to the provisions of Public Law 
96-517 (35 USC 202) in which the contractor has elected not to retain 

title. 


Microfiche Appendix: 

A computer program embodying the invention is listed in the microfiche 
appendix filed with this specification. The microfiche appendix contains 
material which is subject to copyright protection. The copyright owner 
has no objection to the facsimile reproduction by anyone of the patent 
document or the patent disclosure, as it appears in the Patent and Trade- 
mark Office patent file or records, but otherwise reserves all copyrights 

whatsoever. 

Technical Field: 

The invention relates to methods for training neural networks and m 
particular to neural network training methods using adjoint systems of 
equations corresponding to the forward sensitivity equations of the neu- 
ral network. 


Background Art: 

The following publications represent the state of the art in neural net- 
work training techniques, and are referred to in the specification below 
by author name and year: 


Barhen, J., Toomarian, N. and Gulati, S. (1990a). “ Adjoint operator 
algorithms for faster learning in dynamical neural networks”. In David 
S. Touretzky (Eel.), Advances in Ne ural Information Processing 
Systems. Vol. 2, 498-508, San Mateo, gA-,M or g an 
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Barhen, J., Toomarian, N. and Gulati, S. (1990b). “Application of ad- 
joint operators to neural learning”. Applied Mathematical Letters , 3 (3), 
13-18. 

Cacuci, D.G. (1981). “Sensitivity theory for nonlinear systems” . Journal 
Math. Phys ., 22 (12), 2794-2802. 

Grossberg, S. (1987). The Adaptive brain. Vol. 2, North-Holland. 
Hirsch, M.W. (1989). “Convergent activation dynamics in continuous 
time networks”. Neural Networks , 2 (5), 331-349. 

Maudlin, P.J., Parks, C.V. and Weber C.F. (1980). “Thermal- hydraulic 
differential sensitivity theory”. American Society of Mechanical Engi- 
neering paper WA/HT-56. 

Narendra, Iv.S. and Parthasarat.hy, K. (1990). “Identification and con- 
trol of dynamical systems using neural networks” . IEEE transaction on 
Neural Networks , 1 (1), 4-27. 

Oblow, E.M. (1978). “Sensitivity theory for reactor thermal-hydraulic 
problems”. Nuclear Science and Engineering , 68, 322-357. 

Parlos, A.G., et. al. (1991). “Dynamic learning in recurrent neural 
networks for nonlinear system identification” , preprint 
Pearlmutter, B.A. ( 1989). “Learning state space trajectories in recurrent 
neural networks”. Neural Computation , 1 (2), 263-269. 

Pearlmutter, B.A. (1990). “Dynamic recurrent neural networks”. Tech- 
nical Report CMU-CS-90-196, School of Computer Science, Carnegie 
Mellon University, Pittsburgh, PA. Pineda, F. (1990). “Time dependent 
adaptive neural networks”. In David S. Touretzky (Ed.), Advances 
in Neural Information Processing Systems. Vol. 2, 710-718, San 
Mateo, CA: Morgan Ivaufmann. 

Rumelhart, D.E., Hinton, G.E. and Williams, R.J. (1986). “Learning 
internal representations by error propagation”. In D.E. Rumelhart, J. 
L. McCleland and the PDP Research Group, Parallel Distributed 
Processing: Exploration in the Microstructure of Cognition, 
Vol. 1, Foundations, Cambridge: MIT Press/Bradford Books. 

Sato, M. (1990). “A real time learning algorithm for recurrent analog 
neural networks”. Biological Cybernetics , 62 (3), 237-242. 

Toomarian, N., Wacholder, E., and Kaizerman, S. (1987). Sensitivity 
analysis of two-phase flow problems” . Nuclear Science and Engineering , 
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99 (1), 53-81. Toomarian, N. and Barhen, J. (1991). ’’Adjoint operators 
and non- adiabatic algorithms in neural networks” . Applied Mathematical 
Letters , 4 (2), 69-73. 

Werbos, P.J. (1990). “Backpropagation through time: what it does and 
how to do It”, Proceeding of the IEEE , 87 (10). 

Williams, R.J., and Zipser, D. (1988). “A learning algorithm for con- 
tinually running fully recurrent neural networks” . Technical Report ICS 
Report 8805, UCSD, La Jolla, CA 92093. 

Williams, R.J., and Zipser, D. (1989). “A learning algorithm for contin- 
ually running fully recurrent neural networks”, 

Neural Computation , bf 1 (2), 270-280. 

Zak, M. (1989). “Terminal attractors in neural networks”. Neural Net- 
works , 2 (4), 259-274. 

1. INTRODUCTION 

Recently, there has been a tremendous interest in developing learn- 
ing algorithms capable of modeling time-dependent phenomena (Gross- 
berg, 1987; Hirsh, 1989). In particular, considerable attention has been 
devoted to capturing the dynamics embedded in observed temporal se- 
quences (e.g., Narendra, 1990; Parlos et. al., 1991). 

In general, the neural architectures under consideration may be clas- 
sified into two categories: 

* Feedforward networks, in which back propagation through time (W r er- 
bos, 1990) can be implemented. This architecture has been exten- 
sively analyzed, and is widely used in simple applications due, in 
particular, to the straightforward nature of its formalism. 

* Recurrent networks, also referred to as feedback or fully connected 
networks, which are currently receiving increased attention. A key 
advantage of recurrent networks lies in their ability to use informa- 
tion about past events for current computations. Thus, they can 
provide time-dependent, outputs for both time-dependent as well as 
time-independent inputs. 

One may argue that, for many real world applications, the feedfor- 
ward networks suffice. Furthermore, recurrent network can, m principle, 
be unfolded into a multilayer feedforward network (Rumelhart et al. 
1986). A detailed analysis of the merits and demerits of these two archi- 
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lectures is beyond the scope of this paper. Here, we will focus only on 
recurrent networks. 

The problem of temporal learning can typically be formulated as 
a minimization, over an arbitrary but finite time interval, of an appro- 
priate error functional. The gradients of the functional with respect to 
the various parameters of the neural architecture, e.g., synaptic weights, 
neural gains, etc. are essential elements of the minimization process and, 
in the past, major efforts have been devoted to the efficacy of their com- 
putation. Calculating the gradients of a system’s output with respect to 
different parameters of the system is, in general, of relevance to several 
disciplines. Hence, a variety of methods have been proposed in the liter- 
ature for computing such gradients. A recent survey of techniques which 
have been considered specifically for temporal learning can be found in 
Pear Imut ter (1990). We will briefly mention only those which are rele- 
vant to our work. 

Sato (1990) proposed, at the conceptual level, an algorithm based 
upon Lagrange multipliers. However, his algorithm has not yet been 
validated by numerical simulations, nor has its computational complexity 
been analyzed. Williams and Zipser (1989) presented a scheme in which 
the gradients of an error functional with respect to network parameters 
are calculated by direct differentiation of the neural activation dynamics. 
This approach is computationally very expensive and scales poorly to 
large systems. The inherent advantage of the scheme is the small storage 
capacity required, which scales as 0(N 3 ), where N denotes the size of 

the network. _ 

Pearlmutter (1989), on the other hand, described a variational method| 

which yields a set of linear ordinary differential equations for backpropa- 
gating the error through the system. These equations, however, need to 
be solved backwards in time, and require temporal storage of variables 
from the network activation dynamics, thereby reducing the attractive- 
ness of the algorithm. Recently, Toomarian and Barhen (1991) suggested 
a framework which, in contradistinction to Pearlmutters formalism, en- 
ables the error propagation system of equations to be solved forward in 
time, concomitantly with the neural activation dynamics. A drawback 
of this novel approach came from the fact that their equations had to be 
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analyzed in terms of distributions, which precluded straightforward nu- 
merical implementation. Finally, Pineda (1990) proposed combining the 
existence of disparate time scales with a heuristic gradient computation. 
The underlying adiabatic assumptions and highly approximate gradient 
evaluation technique, however, placed severe limits on the applicability 
of his method. 


SUMMARY OF THE INVENTION 
The invention trains a neural network using a method for calculating 
the gradients of an error functional with respect to the system’s parame- 
ters, which builds upon advances in nonlinear sensitivity theory (Oblow 
1978, Cacuci 1981). In particular, it exploits the concept of adjoint oper- 
ators to reduce the computational costs. Two novel systems of equations 
for error propagation (i.e., the adjoint equations ), are at the heart of the 
computational framework. These equations are solved simultaneously 
(i.e., forward in time) with the network dynamics. The computational 
complexity of the algorithm scales as 0{N 3 ) per time step, the storage 
requirements are minimal i.e., of the order of 0(iV 2 ), while complication 
arising from the presence of distributions in our earlier framwork are 
avoided. In the remainder of this specification, the terms sensitivity and 
gradient will be used interchangeably. 


The method of the invention trains a neural network so that a neuron 
output state vector thereof obeys a set of forward sensitivity equations 
over a finite repeatable learning period, the method including setting 
the neuron output state vector to zero at the beginning of the learning 
period, defining first, and auxiliary adjoint systems of equations govern- 
ing an adjoint function and an auxiliary adjoint function, respectively, 
of the neural network and corresponding to the forward sensitivity equa- 
tions, setting the adjoint function to zero at the beginning of the learning 
period and integrating the adjoint system of equations forward in time 
over the learning period to produce a first term of an indirect effect of a 
sensitivity gradient of the neural network, setting the auxiliary adjoint 
function to zero at the end of the learning period and integrating the 
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auxiliary adjoint system of equations forward in time over the learning 
period to produce a remaining term of the indirect effect, computing a 
sum of the first and remaining terms, multiplying the sum by a learning 
rate and subtracting the product thereof from a current neuron param- 
eter vector to produce an updated neuron parameter vector. 


BRIEF DESCRIPTION OF THE DRAWINGS 

Fig. 1 is a diagram of a recurrent neural network employed in car- 
ring out the invention. 

Fig. 2 is a block diagram illustrating architecture which performs 
the process embodying the present invention for training in neural net- 
work by integration of adjoint systems of equations forward in time. 

Fig.’s 3, 4 and 5 illustrate different simulation results of a neural 
network learning a circular motion using the invention. 

Fig. 6 is a graph of the error as a function of the number of learning 
iterations for each of the cases illustrated in Fig.’s 3-5. 


Fig.’s 7, 8 and 9 illustrate different simulation results of a neural 
network learning a figure-eight motion using the invention. 

Fig. 10 is a graph of the error as a function of the number of learn- 
ing iterations for each of the cases illustrated in Fig. s /'-9. 


DETAILED DESCRIPTION OF THE INVENTION 

2. TEMPORAL LEARNING FRAMEWORK 

We formalize a neural network as an adaptive dynamical system 
whose temporal evolution is governed by the following set of coupled 
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nonlinear differential equations: 

il n + K n U n = gnllnC^j Tnm u m + M] t > 0 (1) 


where u n represents the output of the nth neuron (u n (0) being the ini- 
tial state), and T nm denotes the strength of the synaptic coupling from 
the m-th to the n-th neuron. The constants characterize the decay 
of neuron activities. The sigmoidal functions g n (-) modulate the neu- 
ral responses, with gain given by q n ; typically, g n {lnx) = tanh(7 n x). 
In order to implement a nonlinear functional mapping from an Nj- di- 
mensional input space to an N 0 - dimensional output space, the neural 
network is topographically partitioned into three mutually exclusive re- 
gions. As shown in Figure 1, the partition refers to a set of input neurons 
5/, a set of output neurons So, and a set of "hidden” neurons S H • Note 
that this architecture is not formulated in terms of "layers”, and that 
each neuron may be connected to all others including itself. 

Let a(t) (in the sequel the overhead bar will denote a vector) be an N- 
dimensional vector of target temporal patterns, with non zero elements, 
a n (t ), in the input and output sets only. When trajectories, rather than 
mappings, are considered, as we shall see in the sequel, components m 
the input set may also vanish. Hence, the time-dependent external input 
term in Eq. (1), i.e., I n (i), encodes component-contribution of the target 
temporal pattern via the expression 


im = N - (,) 


( 2 ) 


if n E Si 

0 if n E Sh U So 

To proceed formally with the development of a temporal learning 
algorithm, we consider an approach based upon the minimization of an 
error functional, E, defined over the learning period or time interval 


[toJf] by the following expression 

ni f i 


E(u,p ) = f 1 \ E e " dt = l Fdt 

J to n ° 


( 3 ) 


where the error component, e„(1), represents the difference between the 
desired and actual value of the output neurons, i.e., 

... / a„(f) - u„(t) if" £ s ° 

Ml = | o 


if n E Si U Sh 


( 4 ) 
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In our model, the internal dynamical parameters of interest are the 
strengths of the synaptic interconnections, T nm , the characteristic decay 
constants, K n , and the gain parameters, y n . They can be represented as 
a neuron parameter vector of M [M = N 2 + 2iV] components 

p = { Tn, • • • ,T;\';v, K\, • • • ,k.jv, 7i, ■ ■ • ilN } (^) 

We will assume that elements of p are statistically independent. Further- 
more, we will also assume that, for a specific choice of parameters and 
set of initial conditions, a unique solution of Eq. (1) exists. Hence, the 
state variables u (which may be referred to as the neuron output state 
vector) are an implicit function of the parameters p. In the rest of this 
paper, we will denote the p ih element of the vector p by p„ (where the 

neuron parameter index p = 1, • • ■ » M)- 

Traditionally, learning algorithms are constructed by invoking Lya- 
punov stability arguments, i.e., by requiring that the error functional be 
monotonically decreasing during learning time, r . This translates into 

dE _ dE dPv_ < q ( 6 ) 

dr ^ dpu clr 


One can always choose, with a learning rate r) > 0 


dp u dE 

-t-i- = —77- 

dr dp^ 


(7) 


which implements learning in terms of an inherently local minimization 
procedure. Attention should be paid to the fact that Eqs. (1) and (7) 
may operate on different time scales (i.e., the neural network behavior 
time * of Equation 1 and the neural adaptation or learning time r of 
Equations 6 and 7), with parameter adaptation occurring at a slower 
pace. Integrating the dynamical system, Eq.(7), over the interval [r, r + 
At], one obtains, 


p fl {T + At) = p fl (T) - i] 


r+Ar dE_ 
dPn 


dr 


( 8 ) 
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Equation (8) implies that, in order to update a system parameter 
one must evaluate the “ sensitivity ” (i.e., the gradient) of E , Eq. (3), 
with respect to p ^ in the interval [r, r + At]. Furthermore, using Eq. 
(3) and observing that the time integral and derivative with respect to 
commute, one can write 


dpn J to dp M J to dpn J to du dp^ 


(9) 


This sensitivity expression has two parts. The first term in the Right 
Hand Side (RHS) of Eq.(9) is called the “direct effect”, and 

corresponds to the explicit dependence of the error functional on the 
system parameters. The second term in the RHS of Eq. (9) is referred 
to as the “indirect effect”, and corresponds to the implicit relationship 


between the error functional and the system parameters via u. In our 
learning formalism, the error functional, as defined by Eq. (3), does not 
depend explicitly on the system parameters; therefore, the “direct effect 
vanishes, i.e., 


dF 

d/V 


0 


(10a) 


Since F is known analytically (viz. Eqs. (3) and (4)), computation of 
dF/du is straightforward. Indeed 


dF 
d a u 




(10 b) 


Thus, to enable evaluation of the error gradient using Eq. (9), the “in- 
direct effect” matrix du /dp should, in principle, be computed. In the 
sequel, we shall see that this is rather expensive from an algorithmic (i.e., 
computational complexity) perspective, but that an attractive alterna- 
tive, based on the concept of adjoint operators, exists. First, however, 
we introduce the notion of teacher forcing. 


3. TEACHER FORCING A novel neural network ” teacher forcing” 
training method is described in co-pending patent application Serial No. 
07/908,677 filed June 29, 1992 by the inventors herein and entitled ”Fast 
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Temporal Neural Learning Using Teacher Forcing . As indicated in Fig- 
ure 2c of the above-referenced co-pending application, the parameters 
of the network are updated based upon the error accumulated over the 
length of the trajectory or learning period. This method is employed 
in numerical simulations of the present invention described in detail be- 
low. The present invention may be combined with such teacher forcing if 
desired, or the present invention may be practised without teacher forc- 
ing. In order to incorporate this teacher forcing into the neural learning 
formalism presented earlier, the time-dependent input to the neural ac- 
tivation dynamics, Eq.(l), i.e., I n (t) as given by Eq. (2), is modified to 
read: 

f a n (t) if n G 5/ 

I n (t) = < 0 if n £ Sh (11) 

{ - U n(t)] d if n € s o 

At this stage, A and (3 are assumed to be positive constants. The purpose 
of the term [a n (f)] 1-/3 is to insure that I n {t ) has the same dimension as 
a n (t) and u n {t). Zak (1989) has demonstrated that in general, for (3 = 

(2z'+l)/(2j+l), i < j and i andj strictly positive integers, an expression 
of the form [a n — u n induces a terminal attractor phenomenon for the 
dynamics described be Eq. (1). Generally, (3 = if 9 for the numerical 
simulations reported below in this specification. 

When learning is successfully completed, [i.e., e n {t ) = 0] teacher 
forcing will vanish, and the network will revert to the conventional dy- 
namics given by Eqs. (1) and (2). As described in the above-referenced 
co-pending application, we suggest that A be modulated in time as a 
function of the error functional, according to 

A(r) - 1 - (12) 

The above expression should be understood as indicating that, while 
A varies on the learning time scale, it remains at essentially constant 
levels during the iterative passes over the interval [GU/]* 


4. GRADIENT COMPUTATION ALGORITHMS 
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The efficient computation of system response sensitivities (e.g., er- 
ror functional gradients) with respect to all parameters of a network’s 
architecture plays a critically important role in neural learning. In this 
section, we will first review tw r o of the best currently available meth- 
ods for computing these sensitivities, including an explicit approach for 
calculating the matrix du/dp , and an alternative approach, based upon 
the concept of adjoint operators, which involves error back propagation 
through time. This will be followed by the details of a new method, which 
enables an efficient computation of the sensitivities by solving two sys- 
tems of adjoint equations forward in time. 

4.1 State-of-the-art Methods 

4.1.1 DIRECT APPROACH 

Let us differentiate the activation dynamics, Eq. (1), including the 
teacher forcing, Eq. (11), with respect to p ft . We observe that the time 
derivative and partial derivative with respect to p ^ commute. Using the 
shorthand notation d{ • • -)/dp ft = (• • •),/* we obtain a set of equations to 
be referred to in the sequal as “Forward Sensitivity Equations (FSEs). 

/ W „ + YLm Um,n = S n )/( t > 0 H3j 

\ U n ,n =0 * = 0 

in which 

A nm = (K n - In 9n dl n /du n )6nm ~ 7 n 9n Tnm U^) 

S n ,(i = -U n 6 Pu , Kw +I'n9h'^2 u m &p u .T„ m + Qn ( ™ + In)&P M ,7n ( 15 ) 

m m 

In the above expressions. g' n represents the derivative of g n with respect 
to its arguments, S denotes the Kronecker symbol and S n ^ L is defined 
as a nonliomogeneous “source’' . The source term contains all explicit 
derivatives of the neural activation dynamics, Eq. (1), with respect to 
the system parameters, p^ . Hence, it is parameter dependent and its size 
is (N x M). The initial conditions of the activation dynamics, Eq.(l), 
are excluded from the vector of system parameters p. Thus, the initial 
conditions of the FSEs will be taken as zero. Thier solution will provide 


11 


the matrix du/dp needed for computing the “indirect effect” contribution 
to the sensitivity of the error functional, as specified by Eq. (9). This 
algorithm is, essentially, similar to the scheme proposed by Williams and 
Zipser (1989). Computation of the gradients using the forward sensitivity 
formalism would require solving Eq. (13) M times, since the source 
term, S n explicitly depends on p fl . This system has N equations, 
each of which requires multiplication and summation over N neurons. 
Hence, the computational complexity, measured in terms of multiply- 
accumulates, scales like N 2 per system parameter, per time step. Let us 
assume, furthermore, that the interval [^o^/] is discretized into L time 
steps. Then, the total number of multiply-accumulates scales like N 4 L. 
Clearly, such a scheme exhibits expensive scaling properties, and would 
not be very practical for large networks. On the other hand, since the 
FSEs are solved forward in time, along with the neural dynamics, the 
method also has inherent advantages. In particular, there is no need for 
a large amount of memory. Since u n ,n has A 7 3 + 2A“ components, the 
storage requirement scales as 0(A 3 ). 

4.1.2 INDIRECT APPROACH 

In order to reduce the computational complexity associated with 
the above technique for evaluating the “indirect effect” term in Eq.(9), 
an alternative approach can be considered. It is based upon the con- 
cept of adjoint operators, and eliminates the need to compute explicitly 
u v . The critical feature of this approach is that a single vector of 
adjoint functions, v , is obtained, by solving an N-dimensional system of 
equations once, not M times as previously. This vector contains all the 
information required for computing the sensitivities of the error func- 
tional, (IE l dp y,, with respect to all parameters, p M . The necessary and 
sufficient conditions for constructing adjoint equations are discussed in 
Cacuci ( 1981 ). Adjoint equations can be derived in a number of manners, 
including variational, perturbation theoretic, differential and functional 
analytic techniques. Details of derivations, based upon the differential 
approach, for example, can be found in Toomarian (198/) and Maudlin 
et al. (1980). 

It can be shown that an Adjoint System of Equations (ASEs) per- 
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taining to the FSEs, Eq. (13), can be formally written as 

-V n + Y Vm = f > 0 ^ 

m 

where the superscript “T” denotes transposition of the ASE matrix A nm , 
and the indices n and m range from 1 to N. In order to specify Eq. (16) 
in closed mathematical form , we must define the source term, S*, and 
the initial conditions for the system. (An initial condition may apply to 
the beginning of the learning period or the end of the learning period.) 
Both should be independent of and its derivatives. As we will see in 
the sequel, a judicious choice of the source term, 5*, and of the initial 
conditions for the ASEs, Eq. (16), forms the cornerstone of the indirect 
methods. Generally, the source term, S* . is selected m such a way, 
as to make its inner product with u 4l identical to the “indirect effect 
contributions to the sensitivities of the error functional. Selection of the 
time (initial or final) at which the initial conditions are specified will, on 
the other hand, dictate the direction in time in which the ASEs should 
be integrated. For instance, if the set of initial condition is specified at 
t oi the ASEs will be integrated forward in time. On the other hand, if the 
initial conditions are given at //, the ASEs must be integrated backward 
in time. In the remainder of section 4.1, we will derive and discuss the 
advantages and disadvantages of two algorithms, which integrate the 
ASEs backward and forward in time, respectively. 


a - Integration of the ASEs backward in time 


In order to construct an expression which can be used to eliminate 
the dependence of the “indirect effect" term of the sensitivities on 
we have to form the inner product of the vectors 5* and u 4 , . This is done 
by multiplying the FSEs, Eq. (13). by P T , and the ASEs, Eq. (16), by 
u T r and by subtracting the two resulting equations. Then, integrating 
over the time interval [t 0 , t /], yields the bilinear form: 




S,„ ) dt 



(17) 
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In the above equation denotes the inhomogeneous source term of the 
FSEs, Eq. (15), and [...]* denotes the value of the expression in brackets 

evaluated at time t. By identifying 

5* = dF/du n (18°) 


the second integral in the RHS of Eq. (17) is seen to become identical to 
the “indirect effect” term in Eq. (9). Incorporating the initial conditions 
of Eq. (13) into Eq. (1/), we obtain: 


'\ d -ly 

' du> 


w , jji dt 


Jto 


V T S,ndt - [ 


-T - 
v u 






(19) 


How can we eliminate the undesirable presence of from the RHS of 
Eq. (19)? The clear choice is 


m = t f ) = o 


(186) 


resulting in the fundamental identity 



' dF 

du 




v T S„dt 


( 20 ) 


Since the ASEs, (Eqs. 16, and 18 ). and (which is known by defini- 
tion) are independent of , the RHS of Eq. (20) is independent of 
Hence, by solving the ASEs once only, the “indirect effect is evaluated 
via the RHS of Eq. (20) for all system parameters, p. Thus, the above 
identity provides the basis for a computationally efficient algorithm for 
evaluating the gradient of the error functional. It is important to notice 
that since the initial conditions, Eq. (18b), for the ASEs, were given at 
trajectory end time, i.e., at t = t f , Eq. (16), is integrated backward in 
time, i.e., from t = if to t = t 0 . 

We note that the above paradigm results in equations similar to 
those obtained by Pearlmutter (1989) using a variational approach. Both 
algorithms require that the neural activation dynamics, Eq. (1), be first 
solved forward in time (to provide quantities such as A . nm , S ^ ) 
followed by a solution of the ASEs, Ecp (16), integrated backward in 
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time. During the backward integration, the product of the vector v and 
matrix S M is calculated and also integrated from / = tf to t = t Q . The 
result of this integration provides the ‘‘indirect effect” contribution to 
the gradients of the error functional. 

Since u n (t ) and v n (t ) can be obtained at the cost of N multiply- 
accumulat.es per time step, and there are N equations (neurons) per 
network, the complexity of this approach scales like N 2 L. Thus, the 
principal advantage of adjoint methods lies in the dramatic reduction 
of computational costs (e.g., at least 0(N~) for an A -neuron network). 
However, a major drawback to date has come from the necessity to store 
quantities such as g 1 , S* and at each time step. Hence, the memory 
requirements for this method scale as A ~L. Note also that the actual 
evaluation of the RHS of Eq. (9) requires N~L mult.iply-accumulates, 
independent whether the FSEs or ASEs are used. 

b - Integration of the ASEs forward in time 

Is it possible to overcome the rather severe memory limitations of 
the adjoint method, while keeping its computational benefits? We notice 
that Eq. (16) is linear in the variables v. Therefore, it is, in principle, 
possible to obtain identical contributions to Eq. (9) with an alternative 
choice for adjoint source and initial conditions. Indeed, let us select: 



o u 

~ tf) 

(21 a) 

and 





rt\ 

II 

O 

II 

o 


(216) 

where 6(t — i t 

-) is the Dirac distribution. 

Inserting Eqs. 

(21) into Eq. 

(17) and recalling that, by definition of 6 




/ v6{t-tf)dt = 

Jtn 

[v T u mli ] t=tf 

(22) 


we obtain 


This expression is identical to Eq. (20). Therefore, most items discussed 
in connection with this equation will hold. However, v is now the solution 
of the ASEs defined by Eqs. (16) and (21). In contradistinction to the 
previous approach, the initial conditions for the ASEs are given hereat 
t = to, Eq. (21b). Therefore, the ASEs can be integrated forward in 
time, concomitantly with the neural activation dynamics. Potentially, 
storage requirements are reduced by a large amount, since the quantities 
S* and g' are now immediately used at each time step to calculate v and 
its product with the known matrix S 4l . Hence, the memory required 
scales only as 0(N 2 ). The computational complexity of this method 
also scales as 0(N 2 L). A potential drawback, however, lies in the fact 
that Eq. (16) now contains, from eq. (21a), the Dirac distribution, 
S(t _ t f ), operating directly on the adjoint functions, v. This precludes 
straightforward (stable) numerical or VLSI implementation of such a 
scheme. 

4.2 The New Approach 

At this stage, we introduce a new paradigm which will enable us to 
evolve the adjoint dynamics, Eq. (16) forward in time, but without the 
difficulties associated with the above formalism. Consider, again, the 
bilinear form, Eq. (17), associated with the FSEs, Eq. (13), and the 
ASEs, Eq.(16). Let us select 


This is similar to the choice made earlier, when discussing the integration 
of the ASEs backward in time. The second integral in the RHS of Eq. 
(17) will again be identical to the “indirect effect contribution to the 
sensitivity of the error functional, Eq. (9). But, in contradistinction to 
Eq. (18b), we now select as initial conditions: 


v(t = 0) = 0. (246) 

This will allow us to integrate the resulting ASEs, i.e., Eqs. (16) and 
(24), forward in time. Combining Eqs. (9), (17) and (24), we obtain the 
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following expression: 


[ f (^) T u dt = f f (S*) T u 4 i dt = f ( v) T S^dt - [v T uj tf (25) 
J io du ’ J io Jt 0 

The first term in the RHS of Eq. (25) can be computed by using the 
values of v resulting from the solution of Eqs. (16) and (24) forward in 
time. The main difficulty resides with the evaluation of the second term 
in the RHS of Eq. (25), i.e., [v T u^) if . To compute it, we now introduce 
an auxiliary adjoint system, formally similar to Eq. (16): 


-z n + A «» 


= Sn 


t > 0 


(26) 


m 


Once again, the inner product of u and S is required. The bilinear 
form associated with u ^ and r is obtained in a similar fashion to the 
derivation of Eq. (17). Its expression is: 





]// - [- T u n>] 


/' v .( 


to 




S) dt 


(27) 


By choosing 


5 = v{t)8(t-tf) (28a) 

the last term of Eq. (27) reduced to [v T u 4l ] tf , which is the quantity of 
interest. If we furthermore select 


z(tf) = 0. 

and take into account the initial value of u )#1 , Eq. (27) yields 


(28 b) 



S)dt = [v T 



/'• 

Jio 


(5 r 5,u )dt 


(29) 


Note that, even though we selected ;(//) = 0, we are still interested in 
solving the auxiliary adjoint system, Eqs. (26) and (28), forward in time. 


Thus, the critical issue is how to select initial condition (i.e., z(t 0 )), that 
would result in z(tf) = 0. 

Let us first provide a simple qualitative illustration of how this prob- 
lem can be approached. We assume, for the moment, that the matrix A 
in Eq. (26) is time independent. The formal solution of Eqs. (26) and 
(28) can then be written as: 

z(t) = e AT{i - to) z(t 0 ) t < tf ( 30a ) 

z(t f ) = e AT{if ~ io) z{t 0 ) - v{t f ) (30b) 

In principle, using Eq. (30a), the RHS of Eq. (29) can be expressed in 
terms of z(t Q ) . At time t f , where v(t f ) is known from the solution of 
Eqs. (16) and (24), one can calculate the vector z(t 0 ), from Eq. (30b), 
with z(tf) = 0. 

In the problem under consideration, however, the ASE matrix A 
in Eq. (26) is time dependent, (viz Eq. (14)). Thus, it. is practical to 
integrate numerically the auxiliary adjoint equations. Usually, the same 
numerical scheme used for Eqs. (1) and (16) should be adopted. For 
illustrative purposes only, we limit the discussion in the sequel to a first 
order finite differences approximation, i.e., we rewrite Eqs. (26,28) as 

_ ( ~ /+1 ~ ll + \A T }'z l =0 0 < l < L (31) 

A/ 1 

where, superscript 1 implies that the numerical values of the quantities of 
interest at time step l are used and A/ denotes the size of the integration 
time step. Here, [A T f denotes the ASE matrix evaluated at the time 
step l. From the above equation one can easily show that 

A + 1 = B l • B l ~ l ■ ■■ B l • B 0 z(t o ) = B n z(t 0 ) (32) 

where 

B l = I + At [A T ] 1 (l = 0, 1, • • • , L — 1) (33) 

and I denotes the identity matrix. The term B l may be referred to as 
a propagation kernel. Using a numerical approximation for the integral, 
the RHS of Eq. (29) can be recast as 


IS 



[v T sj„ = z T (to) [^[B T ] (, ' 1) '5.'J 

l 


(34) 


Algorithmically, the computation of [v T u^]t f can be described as 
follows. At each time step /, the values of the matrices B ^ and 5 ^, 
are calculated using Eqs. (14, 15 and 33). The needed value of B^ 1 1 ' l ‘ 
is computed by multiplying the stored value of B^ 1 by B ^ The 


result is not only stored for the next iteration, but also used to calculate 
the product of by which, in turn, is added up. The initial 

conditions z(t 0 ) can easily be found at time £/, i.e., at iteration step L, 
by solving the system of linear algebraic equations: 


B {L - l)[ :(t 0 ) = v{t f ) 


(35) 


To summarize, in this new method the computation of the gradients 
of the error functional (i.e., Eq. (9)). involves two stages, corresponding 
to the two terms in the RHS of Eq. (25). The first term is calculated 
from the adjoint functions, v, obtained by integrating Eqs. (16) and (24) 
forward in time along with the neural activation dynamics, Eq. (1). The 
corresponding computational complexity is O ( JV 2 L ) . The second term 
is calculated via Eqs. (34-35), and involves two steps: a) kernel propa- 
gation, Eq. (34), which requires multiplication of two matrices B l and 
j3 (Z_1)! at each time step; hence, its computational complexity scales as 
N 3 L ; b) solution of the linear algebraic system (35) with computational 
complexity of 0(A “). Thus, the overall computational complexity of 
this approach is 0(J\ 3 L ). Notice, however, that here the stoiage needed 
is minimal and equal to 0(A 2 ). 

An architecture for performing the foregoing process is illustrated 
in the block diagram of Figure 2. Each block in Figure 2 may be thought 
of as a separate dedicated processor, although the entire process is im- 
plementable with a single processor using the program of Appendix A. 
Referring to Figure 2, the first step is to set the source term to the partial 
derivative of the error functional F with respect to the neuron output 
state vector and to set the adjoint function at the beginning of the learn- 
ing period to zero (block 200). Then, the adjoint system of equations 
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(Equation 16) is integrated forward in time (block 205) to produce the 
first of two components of the indirect effect of the sensitivity gradient of 
equation (9) (block 210 of Figure 2) and the adjoint function evaluated 
at the end of the learning period (block 215). The next process includes 
blocks 220 through 265 of Figure 2 and corresponds to an integration of 
the auxiliary adjoint system of equations (Equation 26) forward in time 
and is performed contemporaneously with the integration step of block 
205. In block 220, the integration of the auxiliary ASE’s over the learn- 
ing period is divided into L time steps and the time step index l is set 
to zero (block 225). At each time step, the propagation kernel B is com- 
puted (block 230) and the time step index is incremented (block 235). 
The kernels are propagated at each time step by multiplying the current 
kernel by the product of the kernels of all previous time steps at block 
240 and multiplied by the derivative of the source term with respect to 
neuron parameters (block 245) and the result is summed over all time 
steps (block 250). The result of block 240 corresponding to the final time 
step is taken in block 250 for use in solving the system of Equation (35). 
This solution is indicated in Figure 2 as a multiplication (block 260) of 
the inverse of the propagation kernel of block 255 by the adjoint function 
at the end of the learning period (of block 215). However, it is under- 
stood that well-known iterative methods may be employed in solving 
the system of Equation 35 rather than inversing the propagation kernel. 
Finally, the results of blocks 250 and 260 are multiplied together (block 
265) to produce the remaining component of the indirect effect of the 
sensitivity gradient of Equation (19). This remaining component (i.e., 
the product of block 265) is added at block 270 to the first component 
of the indirect effect (from block 210), the resulting sum is multiplied by 
the learning rate (block 275) and subtracted from the current parameter 
vector (block 280) to produce an updated parameter vector (block 285). 
Then, the time t is reset and the learning time r is incremented (block 
290) and the entire process repeated. 

As a final remark, we wish to consider a further approximation, 
based upon the requirement of small A/. The matrices B\ Eq. (33), 
become then diagonally dominant, and B 1 ' can be approximated at each 
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time step l by 


( 36 ) 


B 1 ' = I + Af Y1 A ‘ 

l 

This implies that the computational complexity of the proposed method 
could be further reduced to O ( A 2 L ) for certain class of trajectories. 
At this stage such an approximation is merely an idea which has to be 
sustained via simulations. 

5. NUMERICAL SIMULATIONS 

The new learning paradigm, presented in the preceding section, has 
been applied to the problem of learning two trajectoiies. a circle and 
a figure eight. Results referring to these problems can be found in the 
literature (Pearlmutter 1989), and they offer sufficient complexity for 
illustrating the computational efficiency of our proposed formalism. 

The network that was trained to produce these trajectories involved 
6 fully connected neurons, with no input, 4 hidden and 2 output units. 
An additional “bias” neuron was also included. In our simulations, the 
dynamical systems were integrated using a first, order finite difference 
approximation. The sigmoidal nonlinearity was modeled by a hyperbolic 
tangent. Throughout, the decay constants s . n . the neural gains 7 n , and A 
were set to one. Furthermore, )3 was selected to be 7/9. For the learning 
dynamics, At was set to 6.3 and // to 0.015873. The two output units 
were required to oscillate according to 


a, 5 (/) = A smut 

(36a) 

«(;(/) = A COS u A 

(366) 

for the circular trajectory, and, according to 


az{t) = .4 sin ut 

(37a) 

a a (t) = Asin'2ujt 

(376) 


for the figure eight trajectory. Furthermore, we took A = 0.5 and u = 1. 
Initial conditions were defined at t 0 = 0. Plotting 05 versus Oq produces 
the “desired” trajectory. Since the period of the above oscillations is 
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27T, if — 2n time units are needed to cover one cycle. We selected 
A t = 0.1, to cover one cycle in approximately 63 time steps. 


5.1 Circular Trajectory 

In order to determine the capability and effectiveness of the algo- 
rithm, three cases were examined. As initial conditions, the values of u n 
were assumed to be uniform random numbers between -0.01 and 0.01 for 
the simulation studies referred in the sequel as “Case - 1 and “Case - 
2”. For Case - 3, we set u n equal to zero, except u G which was set to 0.5. 
The synaptic interconnections w^ere initialized to uniform random values 
between —0.1 and +0.1 for all three experiments. 

CASE - 1. 

The training was performed over ij = 6.5 time units( i.e., 65 time 
intervals). A maximum number of 500 iterations was allowed. The re- 
sults shown in Fig. 3 were obtained by starting the network with the 
same initial conditions, w n (0), as used for training, the learned values of 
the synaptic interconnections, T n m , and with no teacher forcing (A = 0). 
As we can see, it takes about 2 cycles until the network reaches a consis- 
tent trajectory. Despite the fact that the system’s output was plotted for 
more than 15 cycles, only the first 2 cycles can be distinguished. Figure 
6 demonstrates that most of the learning occured during the first 300 
iterations. 

CASE - 2. 

Here, we decided to increase the length of the trajectory gradually. 
A maximum number of 800 learning iterations was now allow r ed. The 
length of the training trajectory was 65 time intervals for the first 100 
iterations, and increased every 100 iterations by 10 time intervals. There- 
fore, it was expected that the error functional would increase whenever 
the length of the trajectory was increased. This was indeed observed, as 
may be seen from the learning graph, shown in Fig. 6. The output of 
the trained network is illustrated in Fig. 4. Here again, from 15 recall 
cycles, only the first two (needed to reach the steady orbit) are distin- 
guishable and the rest overlap. Training using greater trajectory lengths 
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yielded a recall circle much closer to the desired one than m the previous 
case. From Fig. 6, one can see that the last 500 iterations did not en- 
hance dramatically the performance of the network. Thus, for practical 
purposes, one may stop the training after the first 300 iterations. 

CASE - 3. 

The selection of appropriate initial conditions for u n plays an im- 
portant role in the effectiveness of the learning. Here, all initial values of 
u n were selected to be exactly zero except the last unit, where u 6 = 0.5 
was chosen. This correspond to an initial point on the circle. The length 
of the trajectory was increased successively, as in the previous case. In 
spite of the fact that we allowed the system to perform up to 800 itera- 
tions, the learning was essentially completed in about 200 iterations, as 
shown in Fig. 6. The results of the network's recall are presented m Fig. 
5, which shows an excellent match. 

5.2 Figure Eight Trajectory 

For this problem, the synaptic interconnections were initialized to 
uniform random values between -1 and +1. As initial conditions, the 
values of u n were assumed to be uniform random numbers between -0.01 
and 0.01. The following three situations were examined. 

CASE - 4- 

The training was performed over ij = 6.5 time units( i.e., 65 time 
intervals). A maximum number of 1000 iterations was allowed. The re- 
sults shown in Fig. 7 were obtained by starting the network with the 
same initial conditions, t* n (0), as used for training, the learned values of 
the synaptic interconnections, T„ m . and with no teacher forcing (A - 0). 
As we can see, it takes about 3 cycles until the network reaches a con- 
sistent trajectory. Despite the fact that the system’s output was plotted 
for more than 15 cycles, only the first 3 cycles can be distinguished. 

CASE -5. 

Here, we again decided to increase the length of the trajectory gra 
ually. A maximum number of 1000 iterations was now allowed. The 
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length of the training trajectory was 65 time intervals for the first 100 
iterations, and was increased every 100 iterations by 5 time intervals. 
Therefore, it was again expected that the objective functional would in- 
crease whenever the length of the trajectory was increased. This was 
indeed observed, as may be seen from the learning graph, shown in Fig. 
10. The output of the trained network is illustrated in Fig. 8. Here 
again, from 15 recall cycles, only the first three (needed to reach the 
steady orbit) are distinguishable, and the rest overlap. As a direct result 
of training using greater trajectory lengths, orbits much closer to the 
desired one than in the previous case weie obtained. 


CASE -6. 

The learning in this case was performed under conditions similar to 
CASE - 5, but with the distinction that A was modulated according to 
Eq. (12). The results of the network's recall are presented in Fig. 9, and 
demonstrate a dramatic improvement with respect to the previous two 

cases. 

It is important to keep in mind the following observations with re- 
gard to the simulation results: 

1) For the circular trajectory, A was kept constant throughout the 
simulations and not modulated according to Eq. (12). As we can see 
from Fig. 6, in cases 1 and 2, the error functional was not reduced to 
zero. Hence, a discrepancy in the functional form of the neural activation 
dynamics used during the learning and recall stages occurred. This was 
a probable cause for the poor performance of the network. In case 3, 
however, the error functional was reduced to zero. Therefore, the teacher 
forcing effect vanished by tlie end of the learning. 

2) For the figure eight trajectory, the differences between cases 5 and 
6 lies in the modulation of A, (i.e., the amplitude of the teacher forcing). 
Even though in both cases the error functional was reduced to a negli- 
gible level, the effect of the teacher forcing in case 5 was not completely 
eliminated over the entire length of the trajectory. This points toward 
the fact that modulation of A not only reduces the number of iterations 
but also provides higher quality results. 
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While the invention has been described detail by specific reference 
to preferred embodiments, it is understood that variations and modifi- 
cations thereof may be made without departing from the true spirit and 
scope of the invention. 
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NEURAL NETWORK TRAINING BY INTEGRATION 
OF ADJOINT SYSTEMS OF EQUATIONS FORWARD IN TIME 

ABSTRACT OF THE INVENTION 
A method and apparatus for supervised neural learning of time de- 
pendent trajectories exploits the concepts of adjoint operators to enable 
computation of the gradient of an objective functional with respect to 
the various parameters of the network architecture in a highly efficient 
manner. Specifically, it combines the advantage of dramatic reductions 
in computational complexity inherent in adjoint methods with the abil- 
ity to solve two adjoint sytems of equations together forward in time. 
Not only is a large amount of computation and storage saved, but the 
handling of real-time applications becomes also possible. The invention 
has been applied it to two examples of representative complexity which 
have recently been analyzed in the open literature and demonstrated 
that a circular trajectory can be learned in approximately 200 iterations 
compared to the 12000 reported in the literature. A figure eight trajec- 
tory was achieved in under 500 iterations compared to 20000 previously 
required. The trajectories computed using our new method are much 
closer to the target trajectories than was reported in previous studies. 
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